Introduction

Liver tissue and Kupffer cells from healthy patients and patients with liver cirrhosis were analysed. Both the cell types included 4 samples from patients with cirrhosis (cases) and 5 controls samples.

Data collection (skip if you already have the DGElist file)

# Download data
gse <- getGEO("GSE123661")
## Found 1 file(s)
## GSE123661_series_matrix.txt.gz
## 
## -- Column specification --------------------------------------------------------
## cols(
##   ID_REF = col_character(),
##   GSM3509143 = col_character(),
##   GSM3509144 = col_character(),
##   GSM3509145 = col_character(),
##   GSM3509146 = col_character(),
##   GSM3509147 = col_character(),
##   GSM3509148 = col_character(),
##   GSM3509149 = col_character(),
##   GSM3509150 = col_character(),
##   GSM3509151 = col_character(),
##   GSM3509152 = col_character(),
##   GSM3509153 = col_character(),
##   GSM3509154 = col_character(),
##   GSM3509155 = col_character(),
##   GSM3509156 = col_character(),
##   GSM3509157 = col_character(),
##   GSM3509158 = col_character(),
##   GSM3509159 = col_character(),
##   GSM3509160 = col_character()
## )
## File stored at:
## C:\Users\oryew\AppData\Local\Temp\Rtmpc5YQu3/GPL17303.soft
# Save experimental data
gsepheno <- pData(gse$GSE123661_series_matrix.txt.gz)
# Download supplementary files with raw count data
getGEOSuppFiles("GSE123661")
##                                                                                                                                  size
## C:/Users/oryew/Documents/UGent/J1-S2/Genome Analysis/Groepswerk/R_RNA_Seq_Data_Analysis_GSE123661/GSE123661/GSE123661_RAW.tar 4741120
##                                                                                                                               isdir
## C:/Users/oryew/Documents/UGent/J1-S2/Genome Analysis/Groepswerk/R_RNA_Seq_Data_Analysis_GSE123661/GSE123661/GSE123661_RAW.tar FALSE
##                                                                                                                               mode
## C:/Users/oryew/Documents/UGent/J1-S2/Genome Analysis/Groepswerk/R_RNA_Seq_Data_Analysis_GSE123661/GSE123661/GSE123661_RAW.tar  666
##                                                                                                                                             mtime
## C:/Users/oryew/Documents/UGent/J1-S2/Genome Analysis/Groepswerk/R_RNA_Seq_Data_Analysis_GSE123661/GSE123661/GSE123661_RAW.tar 2021-05-07 19:01:48
##                                                                                                                                             ctime
## C:/Users/oryew/Documents/UGent/J1-S2/Genome Analysis/Groepswerk/R_RNA_Seq_Data_Analysis_GSE123661/GSE123661/GSE123661_RAW.tar 2021-04-04 20:14:23
##                                                                                                                                             atime
## C:/Users/oryew/Documents/UGent/J1-S2/Genome Analysis/Groepswerk/R_RNA_Seq_Data_Analysis_GSE123661/GSE123661/GSE123661_RAW.tar 2021-05-07 19:01:48
##                                                                                                                               exe
## C:/Users/oryew/Documents/UGent/J1-S2/Genome Analysis/Groepswerk/R_RNA_Seq_Data_Analysis_GSE123661/GSE123661/GSE123661_RAW.tar  no
# List the file names
files <- untar("./GSE123661/GSE123661_RAW.tar", list =T)
# Unzip tar file
untar("./GSE123661/GSE123661_RAW.tar")
# Read first five lines from the first file
read.delim(files[1], nrow = 5)
##   Chromosome Start Stop Strand Gene.Symbol count
## 1 ERCC-00031   555  626      +  ERCC-00031     3
## 2 ERCC-00042   490  567      +  ERCC-00042     1
## 3 ERCC-00069   550  623      +  ERCC-00069     0
## 4 ERCC-00084   491  584      +  ERCC-00084     1
## 5 ERCC-00097   217  301      +  ERCC-00097     4
#consolidate count data (count data contains duplicate rows)

filenames <- substring(files, 0, nchar(files)-3)
dir.create("GSE123661_noDup")
## Warning in dir.create("GSE123661_noDup"): 'GSE123661_noDup' already exists
for (i in 1:length(files)){
  temp_df <- read.delim(files[i], stringsAsFactors = FALSE)
  temp_df.noDup <-aggregate(count~Gene.Symbol,data=temp_df,FUN=sum)
  write.table(temp_df.noDup, file = paste0("GSE123661_noDup/",filenames[i]), append = FALSE, sep = "\t", dec =".", row.names = FALSE, col.names = TRUE)
}

files.liver <- filenames[1:9]
files.kupffer <- filenames[10:18]

# Create DGElist obect (geneID & count)
dge.liver <- readDGE(paste0("GSE123661_noDup/",files.liver), columns=c(1,2))
dge.kupffer <- readDGE(paste0("GSE123661_noDup/",files.kupffer), columns=c(1,2))
str(dge.kupffer)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 2
##   .. ..$ :'data.frame':  9 obs. of  4 variables:
##   .. .. ..$ files       : chr [1:9] "GSE123661_noDup/GSM3509152_Counts_MFK11.txt" "GSE123661_noDup/GSM3509153_Counts_MFK12.txt" "GSE123661_noDup/GSM3509154_Counts_MFK13.txt" "GSE123661_noDup/GSM3509155_Counts_MFK14.txt" ...
##   .. .. ..$ group       : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1
##   .. .. ..$ lib.size    : num [1:9] 13838531 12176771 9216626 7892337 17399266 ...
##   .. .. ..$ norm.factors: num [1:9] 1 1 1 1 1 1 1 1 1
##   .. ..$ : num [1:20827, 1:9] 75 53 872 822 11 41 21 6 807 53 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ Tags   : chr [1:20827] "A1BG" "A1CF" "A2M" "A2M-AS1" ...
##   .. .. .. ..$ Samples: chr [1:9] "GSE123661_noDup/GSM3509152_Counts_MFK11" "GSE123661_noDup/GSM3509153_Counts_MFK12" "GSE123661_noDup/GSM3509154_Counts_MFK13" "GSE123661_noDup/GSM3509155_Counts_MFK14" ...
# Add short samplenames
samplenames.liver <- paste0("L-", substring(colnames(dge.liver), 35,nchar(colnames(dge.liver))-7))
samplenames.kupffer <- paste0("K-", substring(colnames(dge.kupffer), 35,nchar(colnames(dge.kupffer))))

colnames(dge.liver)<-samplenames.liver
colnames(dge.kupffer)<-samplenames.kupffer

# Also create two seperate phenodata objects
gsepheno.liver <- gsepheno[which(gsepheno$source_name_ch1 %in% "Liver tissue"),]
gsepheno.kuppfer <- gsepheno[which(gsepheno$source_name_ch1 %in% "Macrophage"),]

# Add disease information to DGElist
dge.liver$samples$group <- gsub("disease state: ", "", as.character(gsepheno.liver$characteristics_ch1.1))
dge.kupffer$samples$group <- gsub("disease state: ", "", as.character(gsepheno.kuppfer$characteristics_ch1.1))

# Save DGElists
saveRDS(dge.liver, file='RawDGEList_GSE123661_Liver.rds')
saveRDS(dge.kupffer, file='RawDGEList_GSE123661_Kupffer.rds')
# Annotation (optional, our data already contained the gene symbol)
symbol <- rownames(dge.liver)
genes <- AnnotationDbi::select(EnsDb.Hsapiens.v75,keys=symbol, columns=c("ENTREZID","GENENAME"), keytype="SYMBOL")
columns(EnsDb.Hsapiens.v75)
##  [1] "ENTREZID"            "EXONID"              "EXONIDX"            
##  [4] "EXONSEQEND"          "EXONSEQSTART"        "GENEBIOTYPE"        
##  [7] "GENEID"              "GENENAME"            "GENESEQEND"         
## [10] "GENESEQSTART"        "INTERPROACCESSION"   "ISCIRCULAR"         
## [13] "PROTDOMEND"          "PROTDOMSTART"        "PROTEINDOMAINID"    
## [16] "PROTEINDOMAINSOURCE" "PROTEINID"           "PROTEINSEQUENCE"    
## [19] "SEQCOORDSYSTEM"      "SEQLENGTH"           "SEQNAME"            
## [22] "SEQSTRAND"           "SYMBOL"              "TXBIOTYPE"          
## [25] "TXCDSSEQEND"         "TXCDSSEQSTART"       "TXID"               
## [28] "TXNAME"              "TXSEQEND"            "TXSEQSTART"         
## [31] "UNIPROTDB"           "UNIPROTID"           "UNIPROTMAPPINGTYPE"
genes<-genes[!duplicated(genes$ENTREZID),]
dge.liver$genes <- genes
dge.kupffer$genes <- genes

Data exploration (start here if DGElist objects are available)

dge.liver <- readRDS('RawDGEList_GSE123661_Liver.rds')
dge.kupffer <- readRDS('RawDGEList_GSE123661_Kupffer.rds')
# Keep copy of the raw counts
liver.raw <- dge.liver
kupffer.raw <- dge.kupffer

Plot the raw library sizes.

#liver
liver.raw$samples["sample"] <- colnames(dge.liver)
ggplot(data=liver.raw$samples, aes(x = sample, y = lib.size)) + 
  geom_bar(stat="identity", aes(fill = group))+
  labs(title = "Raw Library Sizes; Liver", x = "Sample", y = "Library Size")

#kupffer
kupffer.raw$samples["sample"] <- colnames(dge.kupffer)
ggplot(data=kupffer.raw$samples, aes(x = sample, y = lib.size)) + 
  geom_bar(stat="identity", aes(fill = group))+
  labs(title = "Raw Library Sizes; Kupffer", x = "Sample", y = "Library Size")

Preprocessing

Counts per Million (CPM)

# Converting to CPM
cpm.liver <- cpm(dge.liver) 
cpm.kupffer <- cpm(dge.kupffer)
head(cpm.liver[, 1:4])
##          Samples
## Tags           L-MFK11      L-MFK12      L-MFK13     L-MFK14
##   A1BG    1646.1592956 1864.0821326 2412.2141966 1588.362344
##   A1CF     139.5365222   88.3067456   60.4760210  101.674728
##   A2M     1891.7823573 1648.6175295 1696.0592443 1781.685721
##   A2M-AS1    2.8278971    2.4583210    1.4456419    1.156855
##   A2ML1      0.0000000    0.1446071    0.1606269    0.257079
##   A2MP1      0.8079706    0.3374166    0.0000000    0.000000
# Calculate log counts per million values
lcpm.liver <- cpm(dge.liver, log=TRUE)
lcpm.kupffer <- cpm(dge.kupffer, log=TRUE)
head(lcpm.liver[, 1:4])
##          Samples
## Tags          L-MFK11   L-MFK12    L-MFK13    L-MFK14
##   A1BG    10.68503234 10.864377 11.2362405 10.6334737
##   A1CF     7.12620320  6.467144  5.9222211  6.6701558
##   A2M     10.88565574 10.687185 10.7281107 10.7991603
##   A2M-AS1  1.58153052  1.391381  0.6876136  0.4025374
##   A2ML1   -2.59967432 -1.691603 -1.6188161 -1.2444984
##   A2MP1   -0.03956827 -0.993114 -2.5996743 -2.5996743

Raw data filtering

Next, we will filter out genes with low cpm values. Genes that are not expressed in any sample will eventually not result in differential expression. Additionally, genes with low or zero cpm values in most of the samples are unlikely to result in differential expression. It is best to remove these genes as these genes will also influence the outcome for multiple testing (= more power after removing them).

Determine how many genes show no expression across all samples?

table(rowSums(dge.liver$counts==0)==9) #9 samples (4 cases, 5 controls) #20816 FALSE; 11 TRUE
## 
## FALSE  TRUE 
## 20816    11
table(rowSums(dge.kupffer$counts==0)==9) #9 samples (4 cases, 5 controls) #20825 FALSE; 2 true
## 
## FALSE  TRUE 
## 20825     2

Remove all genes that do not have a cpm value of at least 0.5 in at least three samples: (A cpm value of 0.5 corresponds to 10 reads for a library size of 20.000.000.)

# in total 20827 genes (interesting + uninteresting genes)
liver.keep.exprs3 <- filterByExpr(dge.liver, group=dge.liver$sampples$group) #smallest group is 4
sum(liver.keep.exprs3) #15051 genes left
## [1] 15051
kupffer.keep.exprs3 <- filterByExpr(dge.kupffer, group=dge.kupffer$sampples$group) #smallest group is 4
sum(kupffer.keep.exprs3) #16996 genes left
## [1] 16996
dge.liver <- dge.liver[liver.keep.exprs3,, keep.lib.sizes=FALSE]
dge.kupffer <- dge.kupffer[kupffer.keep.exprs3,, keep.lib.sizes=FALSE]

Let’s compare the raw data with the filtered data using density plots:

#liver
lcpm.raw.counts.L <- cpm(liver.raw, log=TRUE, normalized.lib.sizes = FALSE)
lcpm.raw.counts.plot.L <- melt(as.data.frame(lcpm.raw.counts.L)) %>% 
  rename("Sample" = variable, "Expression" = value) 
## No id variables; using all as measure variables
ggplot(lcpm.raw.counts.plot.L, aes(x = Expression, colour = Sample)) + 
  geom_line(stat = 'density')+
  labs(title = 'Raw data Liver',  x = 'Log-cpm')

lcpm.filtered.counts.L <- cpm(dge.liver, log=TRUE, normalized.lib.sizes = FALSE)
lcpm.filtered.counts.plot.L <- melt(as.data.frame(lcpm.filtered.counts.L))  %>%
  rename("Sample" = variable, "Expression" = value) 
## No id variables; using all as measure variables
ggplot(lcpm.filtered.counts.plot.L, aes(x = Expression, colour = Sample)) + 
  geom_line(stat = 'density')+
  labs(title = 'Filtered data Liver',  x = 'Log-cpm')

#kupffer
lcpm.raw.counts.K <- cpm(kupffer.raw, log=TRUE, normalized.lib.sizes = FALSE)
lcpm.raw.counts.plot.K <- melt(as.data.frame(lcpm.raw.counts.K)) %>% 
  rename("Sample" = variable, "Expression" = value) 
## No id variables; using all as measure variables
ggplot(lcpm.raw.counts.plot.K, aes(x = Expression, colour = Sample)) + 
  geom_line(stat = 'density')+
  labs(title = 'Raw data Kupffer',  x = 'Log-cpm')

lcpm.filtered.counts.K <- cpm(dge.kupffer, log=TRUE, normalized.lib.sizes = FALSE)
lcpm.filtered.counts.plot.K <- melt(as.data.frame(lcpm.filtered.counts.K))  %>%
  rename("Sample" = variable, "Expression" = value) 
## No id variables; using all as measure variables
ggplot(lcpm.filtered.counts.plot.K, aes(x = Expression, colour = Sample)) + 
  geom_line(stat = 'density')+
  labs(title = 'Filtered data Kupffer',  x = 'Log-cpm')

Normalisation

Normalisation is a process designed to identify and remove systematic technical differences between samples that occur in the data to ensure that technical bias has minimal impact on the results.

Trimmed Mean of M-values (TMM) (Robinson and Oshlack 2010)

Using the calcNormFactors function from the edgeR library, we normalise our data using TMM normalisation:

# Keep copy of non-normalised data
dge.liver.before.norm <- dge.liver
dge.kupffer.before.norm <- dge.kupffer

# Perform TMM normalisation
dge.liver <- calcNormFactors(dge.liver, method = "TMM")
dge.liver$samples['sample_id'] <- rownames(dge.liver$samples)
dge.liver$samples$norm.factors
## [1] 0.9535835 1.1600941 1.0083166 0.9689939 1.0510899 0.9324496 0.9579282
## [8] 1.1242418 0.8765408
dge.kupffer <- calcNormFactors(dge.kupffer, method = "TMM")
dge.kupffer$samples['sample_id'] <- rownames(dge.kupffer$samples)
dge.kupffer$samples$norm.factors
## [1] 1.2777421 1.1604493 1.1534283 0.6666004 0.8140009 0.8333141 1.0583055
## [8] 1.0758756 1.1357107
#save preprocessed DGElist objects
saveRDS(dge.liver, file = "Preprocessed_DGEList_GSE123661_Liver.rds")
saveRDS(dge.kupffer, file = "Preprocessed_DGEList_GSE123661_Kupffer.rds")

Have a look at the effect of normalisation by making boxplots of log cpm values before and after normalisation.

#kupffer
lcpm.K <- edgeR::cpm(dge.kupffer.before.norm, log=TRUE)
lcpm.plot.K <- melt(as.data.frame(lcpm.K)) %>% rename("Sample" = variable,
"Expression" = value)
## No id variables; using all as measure variables
ggplot(lcpm.plot.K, aes(y= Expression, colour = Sample)) + geom_boxplot() +
labs(title="Kupffer: CPM with Unnormalised LS",y="Log-cpm")

lcpm.norm.K <- edgeR::cpm(dge.kupffer, log=TRUE)
lcpm.norm.plot.K <- melt(as.data.frame(lcpm.norm.K)) %>%
rename("Sample" = variable, "Expression" = value)
## No id variables; using all as measure variables
ggplot(lcpm.norm.plot.K, aes(y= Expression, colour = Sample)) + geom_boxplot() +
labs(title="Kupffer: CPM with Normalised LS",y="Log-cpm")

#liver
lcpm.L <- edgeR::cpm(dge.liver.before.norm, log=TRUE)
lcpm.plot.L <- melt(as.data.frame(lcpm.L)) %>% rename("Sample" = variable,
"Expression" = value)
## No id variables; using all as measure variables
ggplot(lcpm.plot.L, aes(y= Expression, colour = Sample)) + geom_boxplot() +
labs(title="Liver: CPM with Unnormalised LS",y="Log-cpm")

lcpm.norm.L <- edgeR::cpm(dge.liver, log=TRUE)
lcpm.norm.plot.L <- melt(as.data.frame(lcpm.norm.L)) %>%
rename("Sample" = variable, "Expression" = value)
## No id variables; using all as measure variables
ggplot(lcpm.norm.plot.L, aes(y= Expression, colour = Sample)) + geom_boxplot() +
labs(title="Liver: CPM with Normalised LS",y="Log-cpm")

There is only a minor effect

The effective library size can be obtained by multiplying the library size with additional normalisation factors. Note that for the most count-based statistical testing procedures, the normalization factors are used directly in the statistical models. In other words, in most cases you can simply supply methods such as EdgeR with the raw count data and the scaling factors.

effSize.liver = dge.liver$samples$lib.size * dge.liver$samples$norm.factors
effSize.kupffer = dge.kupffer$samples$lib.size * dge.kupffer$samples$norm.factors

Dimensionality reduction

Genomic datasets are often high-dimensional. Dimensionality reduction analyses provide a way of reducing the complex dataset into a lower dimension (that is more accessible for interpretation) while attempting to retain as much relevant information as possible.

MDS

The plotMDS function performs dimensionality reduction and the distances between the gene expression profiles are visualised. The distances correspond to the leading log-fold-change (if the input data is in the log scale), determined as the root mean square (Euclidean) average of the largest log2-fold changes between each pair of samples (the standard option looks at the top 500 genes with the largest log-fold change for each pair).

# The underlying MDS function is the cmdscale function
# We need the log transformed lcpm counts
mds.liver <- plotMDS(lcpm.norm.L, dim = c(1,8), gene.selection = 'pairwise', plot = FALSE)
mds.kupffer <- plotMDS(lcpm.norm.K, dim = c(1,8), gene.selection = 'pairwise', plot = FALSE)
ggplotMDS <- function(mds, metadata, param, dim1, dim2, label){
  
  # Create MDS plot with ggplot
  
  # Input: mds object
  #       annotation dataframe
  #       parameter of interest in the annotation dataframe 
  #       first dimension of interest
  #       second dimension of interest
  #       how you want the samples to be labeled

  # Extract cmdscale from mds object
  dataMDS <- as.data.frame(mds$cmdscale.out)
  
  i <-  metadata[colnames(metadata) == param]
  l <-  metadata[colnames(metadata) == label]
  
  plot <- ggplot(data = dataMDS, aes(x = dataMDS[,dim1] , 
                                     y = dataMDS[,dim2])) +
    geom_point(aes(colour = as.factor(i[,1])), size = 2)+
    labs(x = paste0("Dimension ", dim1), y = paste0("Dimension ", dim2), 
         colour = param) +
    geom_text(aes(label = l[,1], colour = as.factor(i[,1])), 
              size = 2.8, vjust = -0.5) +
    theme(axis.title.x=element_text(size=11),
          axis.title.y=element_text(size=11),
          legend.text = element_text(size=10), 
          legend.title = element_text(size = 11, face = "bold"))
  
return(plot)
}
#liver
ggplotMDS(mds.liver, as.data.frame(dge.liver$samples), 'group', 1, 2, 'sample_id')

ggplotMDS(mds.liver, as.data.frame(dge.liver$samples), 'group', 1, 3, 'sample_id')

ggplotMDS(mds.liver, as.data.frame(dge.liver$samples), 'group', 3, 4, 'sample_id')

#kupffer (by looking at MDS plots differences are less pronounced here)
ggplotMDS(mds.kupffer, as.data.frame(dge.kupffer$samples), 'group', 1, 2, 'sample_id')

ggplotMDS(mds.kupffer, as.data.frame(dge.kupffer$samples), 'group', 1, 3, 'sample_id')

ggplotMDS(mds.kupffer, as.data.frame(dge.kupffer$samples), 'group', 3, 4, 'sample_id')

An interactive MDS plot can be created with the glimma package.

#liver
#glMDSPlot(lcpm.norm.L,labels=paste(dge.liver$samples$group,sep="_"), groups=dge.liver$samples[,c(2,5)],launch=T)
#kupffer
#glMDSPlot(lcpm.norm.K,labels=paste(dge.kupffer$samples$group,sep="_"), groups=dge.kupffer$samples[,c(2,5)],launch=T)

We can see that the samples from the kupffer cells are more spread out and there is less separation between the control and case samples. So we expect less significant differential expression compared to the liver tissue samples.

#differential expression using EdgeR

Load the Data (if needed)

dge.liver <- readRDS('Preprocessed_DGEList_GSE123661_Liver.rds')
dge.kupffer <- readRDS('Preprocessed_DGEList_GSE123661_Kupffer.rds')

The goal: compare two cell populations (cirrhosis vs control) –> Are there any genes differentially expressed between the different cell populations?

Specify the Design Matrix

#liver
dge.liver$samples$group <- as.factor(dge.liver$samples$group)
design_liver <- model.matrix(~0+group, data = dge.liver$samples)
colnames(design_liver) <- levels(dge.liver$samples$group)
design_liver
##         Cirrhosis Control
## L-MFK11         0       1
## L-MFK12         0       1
## L-MFK13         0       1
## L-MFK14         0       1
## L-MFK15         1       0
## L-MFK16         1       0
## L-MFK18         1       0
## L-MFK8          1       0
## L-MFK9          0       1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
#kupffer
dge.kupffer$samples$group <- as.factor(dge.kupffer$samples$group)
design_kupffer <- model.matrix(~0+group, data = dge.kupffer$samples)
colnames(design_kupffer) <- levels(dge.kupffer$samples$group)
design_kupffer
##             Cirrhosis Control
## K-MFK11             0       1
## K-MFK12             0       1
## K-MFK13             0       1
## K-MFK14             0       1
## K-MFK15             1       0
## K-MFK16             1       0
## K-MFK18             1       0
## K-MFK8              1       0
## K-MFK9_2016         0       1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Estimate Dispersion and Fit Model

We fit genewise generalized linear models.

dge.liver <- estimateDisp(dge.liver, design = design_liver)
names(dge.liver)
##  [1] "samples"            "counts"             "design"            
##  [4] "common.dispersion"  "trended.dispersion" "tagwise.dispersion"
##  [7] "AveLogCPM"          "trend.method"       "prior.df"          
## [10] "prior.n"            "span"
dge.kupffer <- estimateDisp(dge.kupffer, design = design_kupffer)

###BCV plot

Create a BCV plot to visualise the dispersion estimates.

plotBCV(dge.liver)

plotBCV(dge.kupffer)

Fit the model

fit.kupffer <- glmQLFit(dge.kupffer, design_kupffer)
fit.liver <- glmQLFit(dge.liver, design_liver)

names(fit.liver)
##  [1] "coefficients"          "fitted.values"         "deviance"             
##  [4] "method"                "counts"                "unshrunk.coefficients"
##  [7] "df.residual"           "design"                "offset"               
## [10] "dispersion"            "prior.count"           "AveLogCPM"            
## [13] "df.residual.zeros"     "df.prior"              "var.post"             
## [16] "var.prior"             "samples"

QLD

Plot the QL dispersions.

plotQLDisp(fit.liver) # Plot the quasi-likelihood dispersion 

plotQLDisp(fit.kupffer)

Test Differential Expression using glmQLFit

Make a contrast matrix to specify comparisons:

cont.matrix_L <- makeContrasts("Cirrhosis-Control", levels=design_liver) 
cont.matrix_K <- makeContrasts("Cirrhosis-Control", levels=design_kupffer) 

Conduct quasi-likelihood tests to assess differential expression :

qlf.kupffer <- glmQLFTest(fit.kupffer, contrast = cont.matrix_K[ ,1])
qlf.liver <- glmQLFTest(fit.liver, contrast = cont.matrix_L[ , 1]) # 

Determine how many genes have an FDR-rate lower than 0.01?

# Number of genes FDR < 0.01

#liver
top.qlf.liver <- topTags(qlf.liver, n=Inf)
sum(top.qlf.liver$table$FDR<0.01) #34 rather low for GSA (ideally 100-500 DE genes), for GSA use genes with FDR >0.05 -> 405 genes
## [1] 34
head(top.qlf.liver)
## Coefficient:  1*Cirrhosis -1*Control 
##             logFC   logCPM        F       PValue         FDR
## SH3RF2   2.787635 5.261723 77.00470 6.966026e-07 0.003157668
## C5orf47 -3.702139 2.693989 76.93127 7.004097e-07 0.003157668
## SNTG2   -2.336456 3.668089 76.55878 7.200962e-07 0.003157668
## ARMC12   3.086569 3.735648 73.02409 9.423416e-07 0.003157668
## L1CAM   -3.275196 5.481313 71.65478 1.048989e-06 0.003157668
## TOX3    -3.303523 2.543762 64.36499 1.915275e-06 0.004103926
#kupffer
top.qlf.kupffer <- topTags(qlf.kupffer, n=Inf)
sum(top.qlf.kupffer$table$FDR<0.01) # <0.01 -> 0 genes (Top gene FDR = 0.368) already expected less DE genes compared to the liver cells when we saw the  MDS plots
## [1] 0
head(top.qlf.kupffer)
## Coefficient:  1*Cirrhosis -1*Control 
##              logFC   logCPM        F       PValue       FDR
## BMX      -3.863557 2.689451 41.34224 2.167126e-05 0.3683247
## SLC22A23  2.922988 4.409894 30.80981 9.148720e-05 0.4248891
## SMPDL3A   3.040298 3.206437 27.91148 1.448644e-04 0.4248891
## ALDH1A2   2.839239 3.880462 27.84480 1.464619e-04 0.4248891
## NMB       2.466164 3.344851 26.68885 1.776651e-04 0.4248891
## TFF3     -2.782391 0.522475 25.53857 2.165861e-04 0.4248891

conclusion: The liver tissue has 34 significantly DE genes with an FDR < 0.01 –> with GSA we can find out which pathways are over represented. We will use DE genes with an FDR < 0.05 to include enough DE genes (405 genes). For GSA this higher FDR will not be a problem. Kupffer cells show no DE genes (top gene FDR = 0.368). We already expected less DE genes compared to the liver cells when we saw the MDS plots. From here we only work further with the liver tissue cells.

Accounting for confounders

The phenodata didn’t mention any confounders and their research is not available. So unfortunately we could not take into account the effect of confounders like age, gender, alcohol use, etc. on the relation between disease and gene expression.

#save EdgeR DE-output for next step:

#save data needed for GSA as text
write.table(top.qlf.liver$table,"EdgeR_output_Liver.txt",sep="\t",row.names=T)
write.table(top.qlf.kupffer$table,"EdgeR_output_Kupffer.txt",sep="\t",row.names=T) #actually not needed

Gene set analysis

“overrepresentation analysis”: certain GO-term (Gene Ontology term) or a certain pathway is overrepresented amongst our set of interest (e.g. differentially expressed genes) compared to the background. The selection of both this set of interest and the background is of vital importance. Note that selecting a faulty background can already introduce (some) overrepresentation.

To test for overrepresentation, for each term of interest (GO/Pathway) the number of genes in our set of interest and in the background is counted and then tested (typically by chi-squared/Fisher exact test) whether the occurence in the set of interest is significantly different from the occurence in the background. We then correct for multiple testing based on the number of gene sets we analyse.

Load in the data (output DE by EdgeR,)

res.liver <- read.table("./EdgeR_output_Liver.txt", header = T, row.names = NULL, quote="", sep="\t")
colnames(res.liver) <- c("Gene.symbol","logFC","logCPM","F","Pvalue","FDR")

A FDR cut off of 0.01 gives 34 DE genes. Therefore we chose a FDR cut off of 0.05 which will include 405 DE genes in our GSA set. Once we have have our background genes and gene set of interest saved as a textfile, we can use the webtool Webgestalt to perform GSA.

# Select Gene symbols for the background (i.e. all expressed genes)
Gene.symbol.background <- res.liver$Gene.symbol
Gene.symbol.background <- gsub("\"","",Gene.symbol.background)

write.table(Gene.symbol.background, file="Gene_symbol_background.txt", 
            col.names=F, quote=F, row.names=F)

# Select Gene symbols for the set of interest
Gene.symbol.sign <- res.liver$Gene.symbol[(res.liver$FDR<0.05)]
Gene.symbol.sign <- Gene.symbol.sign[!is.na(Gene.symbol.sign)]
Gene.symbol.sign <- sort(as.character(Gene.symbol.sign))
length(Gene.symbol.sign)
## [1] 405
Gene.symbol.sign <- gsub("\"","",Gene.symbol.sign)

write.table(Gene.symbol.sign, file="Gene_symbol_sign.txt", 
            col.names=F, quote=F, row.names=F)

Now that we have have our background genes and gene set of interest saved as a textfile, we can use the webtool Webgestalt to perform GSA. The geneontology database was selected to find out which biological processes are overrepresented in our gene set of interest.

htmltools::includeHTML('Report_wg_result1620406440.html')
WebGestalt (WEB-based GEne SeT AnaLysis Toolkit)

WEB-based GEne SeT AnaLysis Toolkit

Translating gene lists into biological insights...


Summary

Job summary

  • Enrichment method: ORA
  • Organism: hsapiens
  • Enrichment Categories: geneontology_Biological_Process
  • Interesting list: Gene_symbol_sign_1620406440.txt. ID type: genesymbol
  • The interesting list contains 405 user IDs in which 379 user IDs are unambiguously mapped to 379 unique entrezgene IDs and 26 user IDs can not be mapped to any entrezgene ID.
  • The GO Slim summary are based upon the 379 unique entrezgene IDs.
  • Among 379 unique entrezgene IDs, 307 IDs are annotated to the selected functional categories and also in the reference list, which are used for the enrichment analysis.
  • Reference list: uploads/Gene_symbol_background_1620406440.txt ID type: genesymbol
  • The reference list can be mapped to 13913 entrezgene IDs and 11918 IDs are annotated to the selected functional categories that are used as the reference for the enrichment analysis.

Parameters for the enrichment analysis:

  • Minimum number of IDs in the category: 5
  • Maximum number of IDs in the category: 2000
  • FDR Method: BH
  • Significance Level: Top 10

Based on the above parameters, 10 categories are identified as enriched categories and all are shown in this report.

GO Slim summary for the user uploaded IDs

Each Biological Process, Cellular Component and Molecular Function category is represented by a red, blue and green bar, repectively.

The height of the bar represents the number of IDs in the user list and also in the category.

Bar charts of enriched GO Slim terms

Enrichment Results

Redundancy reduction: All Affinity propagation Weighted set cover